Better Filtering with Gapped q-Grams
نویسندگان
چکیده
A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. To achieve these results the arrangement of the gaps in the q-gram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.
منابع مشابه
One-Gapped q-Gram Filtersfor Levenshtein Distance
We have recently shown that q-gram filters based on gapped q-grams instead of the usual contiguous q-grams can provide orders of magnitude faster and/or more efficient filtering for the Hamming distance. In this paper, we extend the results for the Levenshtein distance, which is more problematic for gapped q-grams because an insertion or deletion in a gap affects a q-gram while a replacement do...
متن کاملComputing the Threshold for q-Gram Filters
A popular and much studied class of filters for approximate string matching is based on finding common q-grams, substrings of length q, between the pattern and the text. A variation of the basic idea uses gapped q-grams and has been recently shown to provide significant improvements in practice. A major difficulty with gapped q-gram filters is the computation of the so-called threshold which de...
متن کاملProbeMatch: rapid alignment of oligonucleotides to genome allowing both gaps and mismatches
SUMMARY We have developed a tool, called ProbeMatch, for matching a large set of oligonucleotide sequences against a genome database using gapped alignments. Unlike most of the existing tools such as ELAND which only perform ungapped alignments allowing at most two mismatches, ProbeMatch generates both ungapped and gapped alignments allowing up to three errors including insertion, deletion and ...
متن کاملOn the complexity of finding gapped motifs
A gapped pattern is a sequence consisting of regular alphabet symbols and of joker symbols that match any alphabet symbol. The content of a gapped pattern is defined as the number of its non-joker symbols. A gapped motif is a gapped pattern that occurs repeatedly in a string or in a set of strings. The aim of this paper is to study the complexity of several gapped motif finding problems. The fo...
متن کاملPattern Matching in BWT-Transformed Text
The BWT performs a permutation of the characters in the text, such that characters in lexically similar contexts will be near to each other. The motivation for our approach is the observation that the BWT provides a lexicographic ordering of the input text as part of its inverse transformation process. The decoder only has limited information about the sorted context, but it may be possible to ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Fundam. Inform.
دوره 56 شماره
صفحات -
تاریخ انتشار 2001